nyc.jpg

An Analysis of Airbnb Data from New York City

Megha Guggari, Rohit Mandavia, Ngan Nguyen

Introduction

Airbnb is a popular tool that has made travel easy with simple/straight-forward room bookings all over the world (fun fact there are over 6 million Airbnb listings worldwide!) However, it is always a headache when trying to figure out the best place to book an Airbnb because you have to factor in things such as price, ratings, availability, and area. Our group chose to analyze Airbnb data from NYC after we realized that we were all travelling to NYC after our exams! We thought about how time-consuming it was to find the perfect Airbnb - one with great reviews, a great price, and one that was actually available! In general, we figured that travelling to NYC is pretty common, and we thought it would be useful to visualize things such as prices, ratings, and availabilities per neighborhood to make room bookings easier.

Airbnb releases open data for different cities. The data set we used can be found here: NYC Open Airbnb Data

In this tutorial, you will be able to see visualizations such as how price relates to neighborhood, how availabilities relate to areas in the city, and how ratings relate to price/neighborhoods, to name a few. This information will hopefully make it easier to make informed decisions about the best place to book an Airbnb!

Outline of project:

  1. Data Collection
  2. Data Preprocessing
  3. Data Visualization
  4. Classification/Prediction
  5. Conclusion

Required Libraries/Tools

For this project, we used the following packages:

  1. Matplotlib
  2. Pandas
  3. Folium
  4. Numpy
  5. Sklearn Linear Regression
  6. Math
  7. Seaborn
In [91]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import pandas
import folium
from folium import plugins
from folium.plugins import HeatMap
import numpy as np
from sklearn import linear_model
import math
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

Part 1 - Data Collection

For our data collection, we chose a dataset from Kaggle that contained open data from NYC Airbnb.

Data explanation:

  • id: id of the Airbnb
  • name: description of the Airbnb
  • host_id: id of the host
  • host_name: Name of the host
  • neighbourhood_group: Neighbourhoods were grouped into 5 groups including:
    • Brooklyn
    • Manhattan
    • Bronx
    • Staten Island
    • Queens
  • neighbourhood: Specific neighbourhood name
  • latitude and longitude
  • room_type: type of Airbnb rental including:
    • Private Room
    • Entire Home/Apartment
    • Shared Room
  • price: Price of the Airbnb for one night
  • minimum_nights: minimum number of nights that Airbnb was available
  • number_of_reviews: number of reviews for the Airbnb
  • last_review: date of last review
  • reviews_per_month: Number of reviews per month (a ratio)
  • calculated_host_listings_count: amount of listings per host
  • availability_365: number of days available out of the year (out of 365 days)
In [92]:
data = pandas.read_csv("AB_NYC_2019.csv")
data
Out[92]:
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365
0 2539 Clean & quiet apt home by the park 2787 John Brooklyn Kensington 40.64749 -73.97237 Private room 149 1 9 2018-10-19 0.21 6 365
1 2595 Skylit Midtown Castle 2845 Jennifer Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225 1 45 2019-05-21 0.38 2 355
2 3647 THE VILLAGE OF HARLEM....NEW YORK ! 4632 Elisabeth Manhattan Harlem 40.80902 -73.94190 Private room 150 3 0 NaN NaN 1 365
3 3831 Cozy Entire Floor of Brownstone 4869 LisaRoxanne Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89 1 270 2019-07-05 4.64 1 194
4 5022 Entire Apt: Spacious Studio/Loft by central park 7192 Laura Manhattan East Harlem 40.79851 -73.94399 Entire home/apt 80 10 9 2018-11-19 0.10 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
48890 36484665 Charming one bedroom - newly renovated rowhouse 8232441 Sabrina Brooklyn Bedford-Stuyvesant 40.67853 -73.94995 Private room 70 2 0 NaN NaN 2 9
48891 36485057 Affordable room in Bushwick/East Williamsburg 6570630 Marisol Brooklyn Bushwick 40.70184 -73.93317 Private room 40 4 0 NaN NaN 2 36
48892 36485431 Sunny Studio at Historical Neighborhood 23492952 Ilgar & Aysel Manhattan Harlem 40.81475 -73.94867 Entire home/apt 115 10 0 NaN NaN 1 27
48893 36485609 43rd St. Time Square-cozy single bed 30985759 Taz Manhattan Hell's Kitchen 40.75751 -73.99112 Shared room 55 1 0 NaN NaN 6 2
48894 36487245 Trendy duplex in the very heart of Hell's Kitchen 68119814 Christophe Manhattan Hell's Kitchen 40.76404 -73.98933 Private room 90 7 0 NaN NaN 1 23

48895 rows × 16 columns

Part 2 - Preprocessing

We chose to exclude some columns that were not very informative (ones that we did not think we needed in our analysis). The columns we chose to drop were as follows:

  1. id
  2. name
  3. host_id
  4. last_review
  5. calculated_host_listings_count

As our second preprocessing step, we chose to eliminate rows based on if the prices seemed unreasoable. This included if the prices were listed anywhere between 0-25 (a price that low seemed less common for NYC, especially), or if the prices were listed to be above 250 per night (as college students we wanted to keep prices that were more common).

In [93]:
data = data.drop(columns=['id', 'name', 'host_id', 'host_name', 'last_review', 'calculated_host_listings_count'])

data = data[data['price'] >= 25]
data = data[data['price'] <= 250]

data = data.dropna()
data
Out[93]:
neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews reviews_per_month availability_365
0 Brooklyn Kensington 40.64749 -73.97237 Private room 149 1 9 0.21 365
1 Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225 1 45 0.38 355
3 Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89 1 270 4.64 194
4 Manhattan East Harlem 40.79851 -73.94399 Entire home/apt 80 10 9 0.10 0
5 Manhattan Murray Hill 40.74767 -73.97500 Entire home/apt 200 3 74 0.59 129
... ... ... ... ... ... ... ... ... ... ...
48782 Manhattan Upper East Side 40.78099 -73.95366 Private room 129 1 1 1.00 147
48790 Queens Flushing 40.75104 -73.81459 Private room 45 1 1 1.00 339
48799 Staten Island Great Kills 40.54179 -74.14275 Private room 235 1 1 1.00 87
48805 Bronx Mott Haven 40.80787 -73.92400 Entire home/apt 100 1 2 2.00 40
48852 Brooklyn Bushwick 40.69805 -73.92801 Private room 30 1 1 1.00 1

35217 rows × 10 columns

Part 3 - Visualization

The different things we chose to visualize are as follows:

  1. Room Type vs Price
  2. Room Type vs Availability
  3. Map based on Rating vs Area
  4. Map based on Price vs Area
  5. Map based on Availability vs Area
In [94]:
#1 Room Type vs Price
data_by_roomtype = data.sort_values(["room_type"])
room_types = data_by_roomtype["room_type"].unique()


sums = {"entire": 0, "private": 0, "shared": 0}
tally = {"entire": 0, "private": 0, "shared": 0}
averages = {}

def generate_bar_plot(row, sortBy):
    global sums, tally
    if(row["room_type"]=="Entire home/apt"):
        sums["entire"]+=row[sortBy]
        tally["entire"]+=1
    elif(row["room_type"]=="Private room"):
        sums["private"]+=row[sortBy]
        tally["private"]+=1
    elif(row["room_type"] == "Shared room"):
        sums["shared"]+=row[sortBy]
        tally["shared"]+=1

# data_by_roomtype.apply(generate_bar_plot, axis=1)
for index, row in data_by_roomtype.iterrows():
    generate_bar_plot(row, "price") 

for k in sums:
    averages[k] = sums[k]/tally[k]
    
plt.bar(averages.keys(), averages.values())

plt.title("Rental Type vs Nightly Rate")
plt.ylabel("Nightly Rate")
plt.xlabel("Rental Type")
Out[94]:
Text(0.5, 0, 'Rental Type')
In [95]:
averages
Out[95]:
{'entire': 147.7998358348968,
 'private': 76.52815522800553,
 'shared': 56.74274905422446}

As expected, the rental price for booking an entire apartment or home is significantly more expensive than the other types of rooms.

In [96]:
#2 Room_type vs availability

for index, row in data_by_roomtype.iterrows():
    generate_bar_plot(row, "availability_365") 

for k in sums:
    averages[k] = sums[k]/tally[k]
    
plt.bar(averages.keys(), averages.values())
plt.title("Rental Type vs Availability")
plt.ylabel("Availability out of 365 days")
plt.xlabel("Rental Type")
Out[96]:
Text(0.5, 0, 'Rental Type')

Overall, it is apparent that booking an entire home or apartment is more available in NYC. This was a little surprising to us because our guess would have been that private room bookings would be more available (since we figured that travelling in smaller groups was more common).

In [97]:
averages
Out[97]:
{'entire': 125.42612570356472,
 'private': 95.98321625978812,
 'shared': 112.59268600252207}
In [98]:
#3 Map based on rating

# How to group by neighborhood group and then see the data?
m = folium.Map(location=[40.49, -74.24], zoom_start=11)
df = pandas.DataFrame(data)
df = df.sample(n = 1000)

for i in range(0, len(df)):

    if df.iloc[i]['number_of_reviews'] >= 0 and df.iloc[i]['number_of_reviews'] < 100:
        c = 'darkred'
    elif df.iloc[i]['number_of_reviews'] >= 100 and df.iloc[i]['number_of_reviews'] < 200:
        c = 'black'
    elif df.iloc[i]['number_of_reviews'] >= 200 and df.iloc[i]['number_of_reviews'] < 300:
        c = 'orange'
    elif df.iloc[i]['number_of_reviews'] >= 300 and df.iloc[i]['number_of_reviews'] < 400:
        c = 'white'
    elif df.iloc[i]['number_of_reviews'] >= 400 and df.iloc[i]['number_of_reviews'] < 500:
        c = 'green'
    elif df.iloc[i]['number_of_reviews'] >= 500 and df.iloc[i]['number_of_reviews'] < 600:
        c = 'blue'
    elif df.iloc[i]['number_of_reviews'] >= 600 and df.iloc[i]['number_of_reviews'] < 700:
        c = 'purple'
        
    folium.Circle(
        radius=5,
        location=[df.iloc[i]['latitude'], df.iloc[i]['longitude']],
        popup= df.iloc[i]['neighbourhood_group'],
        color=c,
        fill=False,
    ).add_to(m)
    
m 
Out[98]:

Overall, the number of reviews are between 0-100. There is not a huge apparent pattern between the location of the reviews and the number of reviews as we had thought there would be. However, from the sample, it seems that there is a higher chance of more reviews being farther away from the actual city, such as the Queens area. To see if this is actually the case, we decided to plot data specific to Queens and Brooklyn (which we have shown below).

In [99]:
# Map of ratings for "Queens" neighbourhood 

m = folium.Map(location=[40.49, -74.24], zoom_start=11)
df = pandas.DataFrame(data)
df = df.loc[df['neighbourhood_group'] == "Queens"]
df = df.sample(n=1000)

for i in range(0, len(df)):

    if df.iloc[i]['number_of_reviews'] >= 0 and df.iloc[i]['number_of_reviews'] < 100:
        c = 'darkred'
    elif df.iloc[i]['number_of_reviews'] >= 100 and df.iloc[i]['number_of_reviews'] < 200:
        c = 'black'
    elif df.iloc[i]['number_of_reviews'] >= 200 and df.iloc[i]['number_of_reviews'] < 300:
        c = 'orange'
    elif df.iloc[i]['number_of_reviews'] >= 300 and df.iloc[i]['number_of_reviews'] < 400:
        c = 'white'
    elif df.iloc[i]['number_of_reviews'] >= 400 and df.iloc[i]['number_of_reviews'] < 500:
        c = 'green'
    elif df.iloc[i]['number_of_reviews'] >= 500 and df.iloc[i]['number_of_reviews'] < 600:
        c = 'blue'
    elif df.iloc[i]['number_of_reviews'] >= 600 and df.iloc[i]['number_of_reviews'] < 700:
        c = 'purple'
        
    folium.Circle(
        radius=5,
        location=[df.iloc[i]['latitude'], df.iloc[i]['longitude']],
        popup= df.iloc[i]['neighbourhood_group'],
        color=c,
        fill=False,
    ).add_to(m)
    
m 
Out[99]:

Still as before, the majority of reviews are between 0-100. However, it seems like there is a little more green (reviews between 400-500) than the overall map. However, with just a sample of 1000 it is hard to generalize.

In [100]:
# Map of ratings for "Brooklyn" neighbourhood 

m = folium.Map(location=[40.49, -74.24], zoom_start=11)
df = pandas.DataFrame(data)
df = df.loc[df['neighbourhood_group'] == "Brooklyn"]
df = df.sample(n=1000)

for i in range(0, len(df)):

    if df.iloc[i]['number_of_reviews'] >= 0 and df.iloc[i]['number_of_reviews'] < 100:
        c = 'darkred'
    elif df.iloc[i]['number_of_reviews'] >= 100 and df.iloc[i]['number_of_reviews'] < 200:
        c = 'black'
    elif df.iloc[i]['number_of_reviews'] >= 200 and df.iloc[i]['number_of_reviews'] < 300:
        c = 'orange'
    elif df.iloc[i]['number_of_reviews'] >= 300 and df.iloc[i]['number_of_reviews'] < 400:
        c = 'white'
    elif df.iloc[i]['number_of_reviews'] >= 400 and df.iloc[i]['number_of_reviews'] < 500:
        c = 'green'
    elif df.iloc[i]['number_of_reviews'] >= 500 and df.iloc[i]['number_of_reviews'] < 600:
        c = 'blue'
    elif df.iloc[i]['number_of_reviews'] >= 600 and df.iloc[i]['number_of_reviews'] < 700:
        c = 'purple'
        
    folium.Circle(
        radius=5,
        location=[df.iloc[i]['latitude'], df.iloc[i]['longitude']],
        popup= df.iloc[i]['neighbourhood_group'],
        color=c,
        fill=False,
    ).add_to(m)
    
m 
Out[100]:

The majority of reviews are between 0-100. It is interesting to note that while Queens had a slightly higher chance of showing the green dots that showed 400-500 reviews, this area does not have any (indiciating that it might be less likely).

In [101]:
#4 Map based on Price vs Area

data_price_sample = data.sample(n=1000)

m = folium.Map(location=[40.7128, -74.0060], zoom_start=14)
heat_data = []
for index, row in data_price_sample.iterrows():
    loc_price = []
    lat = row['latitude']
    long = row['longitude']
    price = row['price']
    loc_price.append(lat)
    loc_price.append(long)
    loc_price.append(price)
    heat_data.append(loc_price)
    
HeatMap(heat_data, max_val=10000).add_to(m)
m
Out[101]:
In [102]:
#5 Map based on Available Airbnbs vs Area

data_price_sample = data.sample(n=1000)

m = folium.Map(location=[40.7128, -74.0060], zoom_start=14)
heat_data = []
for index, row in data_price_sample.iterrows():
    loc_availability = []
    lat = row['latitude']
    long = row['longitude']
    availability_365 = row['availability_365']
    loc_availability.append(lat)
    loc_availability.append(long)
    loc_availability.append(-availability_365)
    heat_data.append(loc_availability)
    
HeatMap(heat_data, max_val=0).add_to(m)
m
Out[102]:
In [103]:
map_hooray = folium.Map(location=[40.7, -73.9],
                    zoom_start = 10) 

df_acc = data 
# Ensure you're handing it floats
df_acc['latitude'] = df_acc['latitude'].astype(float)
df_acc['longitude'] = df_acc['longitude'].astype(float)

# Filter the DF for rows, then columns, then remove NaNs
heat_df = df_acc[df_acc['reviews_per_month']>5] # Reducing data size so it runs faster# Reducing data size so it runs faster
heat_df = heat_df[['latitude', 'longitude', 'price']]

# Create weight column, using date
heat_df = heat_df.dropna(axis=0, subset=['latitude','longitude', 'price'])

# List comprehension to make out list of lists
heat_data = [[[row['latitude'],row['longitude']] for index, row in heat_df[heat_df["price"]>20*i][heat_df["price"]<20*(i+1)].iterrows()] for i in range(0, 50)]

# Plot it on the map
hm = plugins.HeatMapWithTime(heat_data,auto_play=True,max_opacity=0.7)
hm.add_to(map_hooray)
# Display the map
map_hooray

# heat_df
C:\Users\rohit\Anaconda3\lib\site-packages\ipykernel_launcher.py:17: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
Out[103]:

The Heatmap/timeseries above plots the locations of airbnbs based on increasing price. The first frame plots Airbnbs with a nightly rate less than 20 dollars and then each subsequent frame goes up by 20 dollars. This helps us get an idea of where the cheaper and more expensive airbnbs might be.

Mean, Median, and Standard Deviation Analysis

We also chose to measure Mean, Median, and Standard Deviations to have a more simple yet informative statistical analysis to get more insight into some of the data including:

  1. Price
  2. Number of Reviews per Month
  3. Number of Reviews (overall)
  4. Availabilities
In [104]:
print("---------------Price-------------")
print("Mean: " + str(np.mean(data["price"])))
print("Median: " + str(np.median(data["price"])))
print("Standard Deviation: " + str(np.std(data["price"])))
---------------Price-------------
Mean: 110.60033506545135
Median: 99.0
Standard Deviation: 56.10137347569434

The mean price is around $112 per night which seems to be standard for a big city. As we saw in our analysis before, the prices of renting out an entire apartment or home is still more expensive than the other types of Airbnbs.

In [105]:
print("---------------Number of Reviews per Month-------------")
print("Mean: " + str(np.mean(data["reviews_per_month"])))
print("Median: " + str(np.median(data["reviews_per_month"])))
print("Standard Deviation: " + str(np.std(data["reviews_per_month"])))
---------------Number of Reviews per Month-------------
Mean: 1.3742527756481455
Median: 0.71
Standard Deviation: 1.6939495945203122
In [106]:
print("---------------Number of Reviews-------------")
print("Mean: " + str(np.mean(data["number_of_reviews"])))
print("Median: " + str(np.median(data["number_of_reviews"])))
print("Standard Deviation: " + str(np.std(data["number_of_reviews"])))
---------------Number of Reviews-------------
Mean: 29.962745265070847
Median: 10.0
Standard Deviation: 49.09259910719924

We expected there to be more reviews for each Airbnb (because we figured that travelling to NYC was common and that people would be leaving more informative reviews for each Airbnb). An interesting thing to test to find out more information about these reviews would be to determine if the reviews were mainly positive or negative, and to see if this had any relationship to the neighbourhood area or price of the Airbnb.

In [107]:
print("---------------Availability-------------")
print("Mean: " + str(np.mean(data["availability_365"])))
print("Median: " + str(np.median(data["availability_365"])))
print("Standard Deviation: " + str(np.std(data["availability_365"])))
---------------Availability-------------
Mean: 110.63318851690944
Median: 48.0
Standard Deviation: 128.35365520252995

Airbnb's are generally available for 1/3 of the entire year! As further analysis, it would be intersting to determine if there are certain times during the year where Airbnb's happen to be the most available as compared to other times. For example, are they more available during the holiday season?

More Statistical Analysis Below

In [108]:
neighbourhood_averages = {}

for row in data.iterrows():
    if row[1][5] in neighbourhood_averages:
        neighbourhood_averages[row[1][1]][0]+=1
        neighbourhood_averages[row[1][1]][1]+=row[1][9]
    else:
        neighbourhood_averages[row[1][1]] = [1, row[1][9]]
neighbourhood_averages
Out[108]:
{'Kensington': [1, 53],
 'Midtown': [1, 7],
 'Clinton Hill': [1, 31],
 'East Harlem': [1, 42],
 'Murray Hill': [1, 2],
 'Bedford-Stuyvesant': [1, 14],
 "Hell's Kitchen": [1, 98],
 'Upper West Side': [1, 15],
 'Chinatown': [1, 215],
 'South Slope': [1, 9],
 'West Village': [1, 8],
 'Williamsburg': [1, 1],
 'Fort Greene': [1, 29],
 'Chelsea': [1, 7],
 'Crown Heights': [1, 89],
 'Park Slope': [1, 8],
 'Windsor Terrace': [1, 0],
 'Inwood': [1, 11],
 'East Village': [1, 326],
 'Harlem': [1, 65],
 'Greenpoint': [1, 43],
 'Bushwick': [1, 1],
 'Lower East Side': [1, 13],
 'Prospect-Lefferts Gardens': [1, 25],
 'Long Island City': [1, 334],
 'Kips Bay': [1, 41],
 'SoHo': [1, 203],
 'Upper East Side': [1, 147],
 'Prospect Heights': [1, 89],
 'Washington Heights': [1, 68],
 'Woodside': [1, 365],
 'Flatbush': [1, 14],
 'Carroll Gardens': [1, 21],
 'Gowanus': [1, 364],
 'Flatlands': [1, 149],
 'Cobble Hill': [1, 43],
 'Flushing': [1, 339],
 'Sunnyside': [1, 188],
 'DUMBO': [1, 0],
 'St. George': [1, 201],
 'Highbridge': [1, 2],
 'Financial District': [1, 181],
 'Morningside Heights': [1, 3],
 'Jamaica': [1, 176],
 'Middle Village': [1, 161],
 'Ridgewood': [1, 133],
 'NoHo': [1, 36],
 'Ditmars Steinway': [1, 26],
 'Roosevelt Island': [1, 61],
 'Greenwich Village': [1, 169],
 'Little Italy': [1, 300],
 'East Flatbush': [1, 357],
 'Tompkinsville': [1, 84],
 'Astoria': [1, 165],
 'Eastchester': [1, 365],
 'Kingsbridge': [1, 84],
 'Boerum Hill': [1, 264],
 'Brooklyn Heights': [1, 0],
 'Two Bridges': [1, 161],
 'Queens Village': [1, 19],
 'Rockaway Beach': [1, 162],
 'Forest Hills': [1, 288],
 'Nolita': [1, 76],
 'Woodlawn': [1, 29],
 'University Heights': [1, 191],
 'Allerton': [1, 175],
 'East New York': [1, 361],
 'Theater District': [1, 39],
 'Concourse Village': [1, 364],
 'Sheepshead Bay': [1, 250],
 'Emerson Hill': [1, 38],
 'Fort Hamilton': [1, 322],
 'Bensonhurst': [1, 18],
 'Tribeca': [1, 66],
 'Shore Acres': [1, 0],
 'Sunset Park': [1, 311],
 'Concourse': [1, 20],
 'Gramercy': [1, 160],
 'Elmhurst': [1, 23],
 'Brighton Beach': [1, 353],
 'Jackson Heights': [1, 326],
 'Cypress Hills': [1, 82],
 'St. Albans': [1, 167],
 'Arrochar': [1, 81],
 'Rego Park': [1, 1],
 'Wakefield': [1, 62],
 'Clifton': [1, 312],
 'Bay Ridge': [1, 356],
 'Spuyten Duyvil': [1, 326],
 'Stapleton': [1, 179],
 'Briarwood': [1, 208],
 'Ozone Park': [1, 52],
 'Columbia St': [1, 13],
 'Vinegar Hill': [1, 303],
 'Mott Haven': [1, 40],
 'Longwood': [1, 14],
 'Canarsie': [1, 360],
 'Battery Park City': [1, 339],
 'East Elmhurst': [1, 358],
 'New Springville': [1, 4],
 'Morris Heights': [1, 65],
 'Arverne': [1, 362],
 'Gravesend': [1, 222],
 'Mariners Harbor': [1, 140],
 'Concord': [1, 68],
 'Borough Park': [1, 25],
 'Downtown Brooklyn': [1, 6],
 'Flatiron District': [1, 6],
 'Civic Center': [1, 327],
 'Port Morris': [1, 88],
 'Fieldston': [1, 192],
 'Kew Gardens': [1, 305],
 'Midwood': [1, 86],
 'Mount Eden': [1, 0],
 'City Island': [1, 18],
 'Glendale': [1, 90],
 'Red Hook': [1, 325],
 'Richmond Hill': [1, 365],
 'Maspeth': [1, 21],
 'Port Richmond': [1, 365],
 'Williamsbridge': [1, 47],
 'Soundview': [1, 365],
 'Woodhaven': [1, 359],
 'Co-op City': [1, 365],
 'Stuyvesant Town': [1, 321],
 'Parkchester': [1, 68],
 'North Riverdale': [1, 174],
 'Dyker Heights': [1, 89],
 'Bronxdale': [1, 194],
 'Riverdale': [1, 52],
 'Kew Gardens Hills': [1, 160],
 'Bay Terrace': [1, 169],
 'Norwood': [1, 271],
 'Claremont Village': [1, 88],
 'Fordham': [1, 175],
 'Bayswater': [1, 90],
 'Navy Yard': [1, 0],
 'Brownsville': [1, 35],
 'Eltingville': [1, 291],
 'Mount Hope': [1, 167],
 'Clason Point': [1, 86],
 'Lighthouse Hill': [1, 71],
 'Springfield Gardens': [1, 89],
 'Howard Beach': [1, 78],
 'Belle Harbor': [1, 52],
 'Jamaica Estates': [1, 43],
 'Van Nest': [1, 342],
 'Bellerose': [1, 342],
 'Bayside': [1, 345],
 'Morris Park': [1, 343],
 'West Brighton': [1, 80],
 'College Point': [1, 212],
 'Far Rockaway': [1, 107],
 'South Ozone Park': [1, 327],
 'Tremont': [1, 146],
 'Corona': [1, 337],
 'Great Kills': [1, 87],
 'Manhattan Beach': [1, 215],
 'Marble Hill': [1, 52],
 'Dongan Hills': [1, 310],
 'Fresh Meadows': [1, 178],
 'East Morrisania': [1, 89],
 'Hunts Point': [1, 59],
 'Pelham Bay': [1, 336],
 'Randall Manor': [1, 342],
 'West Farms': [1, 310],
 'Silver Lake': [1, 0],
 'Laurelton': [1, 70],
 'Grymes Hill': [1, 44],
 'Holliswood': [1, 135],
 'Pelham Gardens': [1, 354],
 'Rosedale': [1, 89],
 'Edgemere': [1, 363],
 'New Brighton': [1, 10],
 'Baychester': [1, 46],
 'Melrose': [1, 0],
 'Sea Gate': [1, 180],
 'Bergen Beach': [1, 164],
 'Cambria Heights': [1, 151],
 'Richmondtown': [1, 300],
 'Throgs Neck': [1, 365],
 'Howland Hook': [1, 363],
 'Schuylerville': [1, 343],
 'Coney Island': [1, 129],
 "Prince's Bay": [1, 66],
 'South Beach': [1, 176],
 'Bath Beach': [1, 90],
 'Midland Beach': [1, 231],
 'Jamaica Hills': [1, 139],
 'Castleton Corners': [1, 40],
 'Oakwood': [1, 364],
 'Castle Hill': [1, 42],
 'Douglaston': [1, 143],
 'Huguenot': [1, 259],
 'Whitestone': [1, 5],
 'Edenwald': [1, 34],
 'Belmont': [1, 359],
 'Grant City': [1, 188],
 'Westerleigh': [1, 36],
 'Tottenville': [1, 299],
 'Morrisania': [1, 90],
 'Bay Terrace, Staten Island': [1, 0],
 'Westchester Square': [1, 355],
 'Little Neck': [1, 88],
 'Rosebank': [1, 179],
 'Mill Basin': [1, 322],
 'Hollis': [1, 89],
 'Arden Heights': [1, 55],
 "Bull's Head": [1, 362],
 'Olinville': [1, 188],
 'Neponsit': [1, 44],
 'Graniteville': [1, 0],
 'Unionport': [1, 365],
 'Rossville': [1, 59],
 'Breezy Point': [1, 59],
 'Willowbrook': [1, 351],
 'New Dorp Beach': [1, 307],
 'Todt Hill': [1, 36]}
In [109]:
data
Out[109]:
neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews reviews_per_month availability_365
0 Brooklyn Kensington 40.64749 -73.97237 Private room 149 1 9 0.21 365
1 Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225 1 45 0.38 355
3 Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89 1 270 4.64 194
4 Manhattan East Harlem 40.79851 -73.94399 Entire home/apt 80 10 9 0.10 0
5 Manhattan Murray Hill 40.74767 -73.97500 Entire home/apt 200 3 74 0.59 129
... ... ... ... ... ... ... ... ... ... ...
48782 Manhattan Upper East Side 40.78099 -73.95366 Private room 129 1 1 1.00 147
48790 Queens Flushing 40.75104 -73.81459 Private room 45 1 1 1.00 339
48799 Staten Island Great Kills 40.54179 -74.14275 Private room 235 1 1 1.00 87
48805 Bronx Mott Haven 40.80787 -73.92400 Entire home/apt 100 1 2 2.00 40
48852 Brooklyn Bushwick 40.69805 -73.92801 Private room 30 1 1 1.00 1

35217 rows × 10 columns

In [110]:
manhattan_mean = data[data["neighbourhood_group"]=="Manhattan"].price.mean()
brooklyn_mean = data[data["neighbourhood_group"]=="Brooklyn"].price.mean()
queens_mean = data[data["neighbourhood_group"]=="Queens"].price.mean()
staten_mean = data[data["neighbourhood_group"]=="Staten Island"].price.mean()
bronx_mean = data[data["neighbourhood_group"]=="Bronx"].price.mean()
In [111]:
def replace_neighbourhood(n):
    if(n == "Manhattan"): 
        return manhattan_mean
    elif(n == "Brooklyn"): 
        return brooklyn_mean
    elif(n == "Queens"): 
        return queens_mean
    elif(n == "Staten Island"): 
        return staten_mean
    elif(n == "Bronx"): 
        return bronx_mean
    
def replace_type(t):
    if(t == "Entire home/apt"):
        return averages["entire"]
    elif(t == "Private room"):
        return averages["private"]
    elif(t == "Shared room"):
        return averages["shared"]
In [112]:
pre = data.drop(['minimum_nights','availability_365'], axis=1)
pre.neighbourhood_group = pre["neighbourhood_group"].apply(replace_neighbourhood)
pre.room_type = pre["room_type"].apply(replace_type)
In [113]:
neighbourhood_averages2 = {}
for k in neighbourhood_averages:
    neighbourhood_averages2[k] = neighbourhood_averages[k][1]/neighbourhood_averages[k][0]
In [114]:
def neighbourhood_change(n):
    global neighbourhood_averages2
    return neighbourhood_averages2[n]

pre.neighbourhood = pre["neighbourhood"].apply(neighbourhood_change)

pre
Out[114]:
neighbourhood_group neighbourhood latitude longitude room_type price number_of_reviews reviews_per_month
0 101.481853 53.0 40.64749 -73.97237 95.983216 149 9 0.21
1 131.420141 7.0 40.75362 -73.98377 125.426126 225 45 0.38
3 101.481853 31.0 40.68514 -73.95976 125.426126 89 270 4.64
4 131.420141 42.0 40.79851 -73.94399 125.426126 80 9 0.10
5 131.420141 2.0 40.74767 -73.97500 125.426126 200 74 0.59
... ... ... ... ... ... ... ... ...
48782 131.420141 147.0 40.78099 -73.95366 95.983216 129 1 1.00
48790 84.539873 339.0 40.75104 -73.81459 95.983216 45 1 1.00
48799 83.415282 87.0 40.54179 -74.14275 95.983216 235 1 1.00
48805 73.342012 40.0 40.80787 -73.92400 125.426126 100 2 2.00
48852 101.481853 1.0 40.69805 -73.92801 95.983216 30 1 1.00

35217 rows × 8 columns

Violin Plot

In [115]:
violin = sns.violinplot(data=data, x='neighbourhood_group', y='price')
violin.set_title('Density and distribution of prices for each neighberhood_group')
Out[115]:
Text(0.5, 1.0, 'Density and distribution of prices for each neighberhood_group')

The Violin plots above show the price distribution of Airbnbs in the 5 bouroughs of NYC. All of the bouroughs, except for Manhattan, are clearly unimodal with the modes falling in the 40-60 dollar range. Manhattan, however, has a much more even distribution suggesting that the prices are going to be much more expensive, on average, than in the other bouroughs. It is important that our model takes this into account!

Part 4 - Classification

In [116]:
data_x = pre[["neighbourhood", "neighbourhood_group", "latitude", "longitude"]]
y = pre.price

X_train, X_test, y_train, y_test = train_test_split(data_x, y, test_size=0.2)


reg = linear_model.LinearRegression()
reg.fit(X_train, y_train)

predicts = reg.predict(X_test)

print("""
        Mean Absolute Error: {}
        Root Mean Squared Error: {}
        R2 Score: {}
     """.format(mean_absolute_error(y_test,predicts),np.sqrt(mean_squared_error(y_test, predicts)),r2_score(y_test,predicts),))
        Mean Absolute Error: 42.6601773417156
        Root Mean Squared Error: 52.28324054988423
        R2 Score: 0.13169825206150432
     

For the first linear regression we used 4 features. These are all the features that are related to geographic location since location is often a huge factor when it comes to determining a price for a home or apartment. While the MAE and RMSE were very reasonable but the R^2 score was only 0.14 which is not desireable. This suggested we needed to add more features if we are to continue with linear regression

In [117]:
data_x = pre[["neighbourhood", "latitude", "longitude"]]
y = pre.price

X_train, X_test, y_train, y_test = train_test_split(data_x, y, test_size=0.2)


reg = linear_model.LinearRegression()
reg.fit(X_train, y_train)

predicts = reg.predict(X_test)

print("""
        Mean Absolute Error: {}
        Root Mean Squared Error: {}
        R2 Score: {}
     """.format(mean_absolute_error(y_test,predicts),np.sqrt(mean_squared_error(y_test, predicts)),r2_score(y_test,predicts),))

a = reg.coef_
i = reg.intercept_
        Mean Absolute Error: 44.00267826102187
        Root Mean Squared Error: 53.824499597096484
        R2 Score: 0.09739233274730563
     

We first tried to remove the neighbourhood features since neighbourhood and neighbourhood_group are related. In fact, the neighbourhood group is just a more precise version of the neighbourhood. However, the R^2 value only decreased so this was not a good move. From then on we decided to include both the neithbourhood and the neighbourhood group.

In [118]:
data_x = pre.drop(["price"], axis=1)
y = pre.price

X_train, X_test, y_train, y_test = train_test_split(data_x, y, test_size=0.2)


reg = linear_model.LinearRegression()
reg.fit(X_train, y_train)

predicts = reg.predict(X_test)

print("""
        Mean Absolute Error: {}
        Root Mean Squared Error: {}
        R2 Score: {}
     """.format(mean_absolute_error(y_test,predicts),np.sqrt(mean_squared_error(y_test, predicts)),r2_score(y_test,predicts),))
        Mean Absolute Error: 31.43313104269536
        Root Mean Squared Error: 40.93980434797532
        R2 Score: 0.46770082510128985
     

Next, we simply tried to include all the features as there is likely some correlations between the availability, reviews, etc... on the price of an airbnb. This would allow the linear regression to have as meany features as possible and adjust the weights based on them. We were afraid that including all the features may not help more than the geographic features alone but the R^2 value jumped all the way up to 0.48.

In [119]:
data_x = pre.drop(["price", "number_of_reviews", "reviews_per_month"], axis=1)
y = pre.price

X_train, X_test, y_train, y_test = train_test_split(data_x, y, test_size=0.2)

reg = linear_model.LinearRegression()
reg.fit(X_train, y_train)

predicts = reg.predict(X_test)

print("""
        Mean Absolute Error: {}
        Root Mean Squared Error: {}
        R2 Score: {}
     """.format(mean_absolute_error(y_test,predicts),np.sqrt(mean_squared_error(y_test, predicts)),r2_score(y_test,predicts),))

a = reg.coef_
i = reg.intercept_
        Mean Absolute Error: 32.01262714429472
        Root Mean Squared Error: 41.337224159693264
        R2 Score: 0.4672694973705521
     

Lastly, we felt that the review based features may not be accurate predictors considering that an airbnb may have a lot of reviews for bring really good or for being really bad. This would suggest that it would not be a great predictor so we tried a linear regression without these features and the statistics essentially remained the same. This suggests that number and frequency of reviews are not a great feature to use for this particular dataset.

Conclusion

Through our data and analysis we can infer that the price of an Airbnb room is higher in Manhattan than the other 4 neighborhoods (Brooklyn, Queens, Staten Island, and Bronx). This could be due to higher tourist activities in Manhattan.

Our data and analysis can be used to help people, such as travelers, find places to stay in New York City that meet their preferences in terms of neighborhood, availability, and price. It can also be used to predict the price of an Airbnb room that meets their preference.

For future research we plan on predicting the popularity of an Airbnb room based on its attributes such as room type, neighborhood and availability. This analysis could be beneficial for audiences that want to list their homes in the New York City area and want to know the success rate.

To improve on our data and analysis we can add more attributes that would provide better insights and linear regression predictions. For example, we can include crime rate in each neighborhood group and neighborhood to see if it has an effect on the price and popularity of an Airbnb room. We can also include the exact square footage of an Airbnb room to better predict it's price value.

In [ ]: